A Survey on text categorization of Indian and non-Indian languages using supervised learning techniques

نویسندگان

  • Khyati S. Kava
  • Nikita P. Desai
چکیده

Categorization of text plays an important role in the text mining field. Text categorization is the process in which documents are categorized into its predefined category. Automatic text categorization is an important task due to large amount of electronic documents. This paper presents a survey of Text categorization of Indian and non-Indian languages. There is very less work done in text categorization of Indian languages. To extract the features of documents, mostly TF-IDF (Term frequency-Inverse document frequency) method is used. Major classifiers such as SVM (support vector machine), NB (Naïve Bayes), Decision tree and K-NN (K-Nearest neighbor) are used for text categorization process. Measures used to evaluate performance of text categorization are recall, precision and fmeasure. Keywords-Text Categorization, TF-IDF, SVM, NB.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Named Entity Recognition for Code Mixing in Indian Languages using Hybrid Approach

Automating the process of Named Entity Recognition has received a lot of attention over past few years in Social Media Text. Named Entities are real world objects such as Person, Organization, Product, Location. Identifying these entities in social media text is an important challenging task due the informal nature of text present on social media. One such challenge that is faced in recognizing...

متن کامل

Cost Effective Dependency Parsing for Indian Languages

Indian languages are MoR-FWO1 and hence differ from English in structure and morphology. There are many distinguished characteristics possessed by Indian languages. While working with these languages we have to keep in mind, these characteristics and plan strategies accordingly. We worked on improving Dependency Parsing for Indian Languages, more specifically for Hindi, an Indo-Aryan Language. ...

متن کامل

Different Techniques Implemented in Gurumukhi Word Sense Disambiguation

One of the most important issues in the field of Natural Language Engineering is Word Sense Disambiguation (WSD).Gurumukhi or more commonly known as Punjabi, is world’s 12th most widely spoken language and this language is morphologically rich. But surprisingly, there are relatively less efforts in the field of computerization and development of lexical resources of this language. It is therefo...

متن کامل

A survey on text mining techniques

text mining is a technique to find meaningful patterns from the available text documents. The pattern discovery from the text and document organization of document is a well-known problem in data mining. Analysis of text content and categorization of the documents is a complex task of data mining. In order to find an efficient and effective technique for text categorization, various techniques ...

متن کامل

A study on efficiency and productivity of Indian non-life insurers using data envelopment analysis

This paper talks about the measurement of efficiency and productivity of non-life insurance firms in India. This study is focused on twelve private non-life insurance firms and four public sector non-life insurance firms of India in the period 2008-09 to 2012-13. Data Envelopment Analysis (DEA) coupled with Malmquist productivity Index is used in measuring the efficiency as well as productivity...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015